Eligibility Traces for Off-Policy Policy Evaluation
نویسندگان
چکیده
Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policies can be learned about from the same data stream, and have been identified as particularly useful for learning about subgoals and temporally extended macro-actions. In this paper we consider the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method. We analyze and compare this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Our main results are 1) to establish the consistency and bias properties of the new methods and 2) to empirically rank the new methods, showing improvement over one-step and Monte Carlo methods. Our results are restricted to model-free, table-lookup methods and to offline updating (at the end of each episode) although several of the algorithms could be applied more generally. In reinforcement learning, we generally learn from experience, that is, from the sequence of states, actions, and rewards generated by the agent interacting with its environment. This data is affected by the decision-making policy used by the agent to select its actions, and thus we often end up learning something that is a function of the agent’s policy. For example, the common subproblem of policy evaluation is to learn the value function for the agent’s policy (the function giving the expected future reward available from each state–action pair). In general, however, we might want to learn about policies other than that currently followed by the agent, a process known as off-policy learning. For example, 1-step Q-learning is often used in an off-policy manner, learning about the greedy policy while the data is generated by a slightly randomized policy that ensures exploration. Off-policy learning is especially important for research on the use of temporally extended actions in reinforcement learning (Kaelbling, 1993; Singh, 1992; Parr, 1998; Dietterich, 1998; Sutton, Precup & Singh, 1999). In this case, we are interested in learning about many different policies, each corresponding to a different macro-action, subgoal, or option. Off-policy learning enables the agent to use its experience to learn about the values and models of all the policies in parallel, even though it can follow only one policy at a time (Sutton, Precup & Singh, 1998). In this paper we consider the natural generalization of the policy evaluation problem to the off-policy case. That is, we consider two stationary Markov policies, one used to generate the data, called the behavior policy, and one whose value function we seek to learn, called the target policy. The two policies are completely arbitrary except that the behavior policy must be soft, meaning that it must have a non-zero probability of selecting every action in each state. (The last method we consider has weaker requirements, not even requiring that the behavior policy be stationary, only non-starving.) This policy evaluation problem is a particularly clear and pure case of off-policy learning. Whatever we learn about it we expect to elucidate, if not directly transfer to, the problem of learning value functions and models of temporally extended macro-actions. There are few existing model-free algorithms1 that apply to off-policy policy evaluation. There is a natural one-step method, TD(0), but the more general TD(λ), for λ > 0, 1In this paper we restrict attention to methods that learn directly from experience rather than form an explicit model of the environment. Such model-free methods have been emphasized in reinforcement learning because of their simplicity and robustness to modeling errors and assumptions. fails because it includes some effect of multi-step transitions, which are contaminated by the behavior policy and not compensated for in any way. The only prior method we know of that uses multi-step transitions appropriately is the weighted Monte Carlo method described briefly by Sutton and Barto (1998). There are at least three variations of Q-learning which use eligibility traces, Watkins’s Q(λ) (Watkins, 1989), Peng’s Q(λ) (Peng & Williams, 1996), and naive Q(λ) (Sutton & Barto, 1998). Like 1-step Qlearning, these are all off-policy methods, but they apply only to the special case in which the target policy is deterministic and changing (to always be greedy with respect to the current value function estimate). These methods cannot be applied directly to our simpler but more general policy evaluation problem, although two of our four new methods reduce to Watkins’s Q(λ) in the special case in which the target policy is deterministic. 1. Reinforcement Learning (MDP) Notation In this paper we consider the episodic framework, in which the agent interacts with its environment in a sequence of episodes, numbered m = 1;2;3; : : :, each of which consists of a finite number of time steps, t = 0;1;2; : : : ;Tm. The first state of each episode, s0 2 S is chosen according to some fixed distribution. Then, at each step t, the agent perceives the state of the environment, st 2 S, and on that basis chooses an action, at 2 A. In response to at , the environment produces, one step later, a numerical reward, rt+1 2 R, and a next state, st+1. If the next state is the special terminal state, then the episode terminates at time Tm = t + 1. We assume here that S and A are finite and that the environment is completely characterized by one-step state-transition probabilities, pass0 , and one-step expected rewards, ra s , for all s;s0 2 S and a 2 A. A stationary way in which the agent might behave, or policy, is specified by a mapping from states to probabilities of taking each action: π : S A! [0;1℄. The value of taking action a in state s under policy π, denoted Qπ(s;a), is the expected discounted future reward starting in s, taking a, and henceforth following π: Qπ(s;a) def = Eπnr1 + γr2 + + γ rT s0 = s;a0 = ao: where 0 γ 1 is a discount-rate parameter and T is the time of termination. The function Qπ : S A!R is known as the action-value function for policy π. The problem we consider in this paper is that of estimating Qπ for an arbitrary target policy π, given that all data is generated by a different behavior policy b, where b is soft, meaning b(s;a)> 0;8s 2 S;a 2 A. 2. Importance Sampling Algorithms One way of viewing the special difficulty of off-policy learning is that it is a mismatch of distributions—we would like data drawn from the distribution of the target policy but all we have is data drawn from the distribution of the behavior policy. Importance sampling (e.g., see Rubinstein, 1981) is a classical technique for handling just this kind of mismatch. In particular, it is for estimating the expected value of a random variable x with distribution d from samples, when the samples are drawn from another distribution d0. For example, the target distribution d could be normal, while the sampling distribution d0 is uniform, as below.
منابع مشابه
Off-policy learning with eligibility traces: a survey
In the framework of Markov Decision Processes, we consider linear off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, ...
متن کاملRecursive Least-Squares Off-policy Learning with Eligibility Traces
In the framework of Markov Decision Processes, we consider the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We review on-policy learning least-squares algorithms of the literature (LSTD (Boyan, 1999), LSPE (Bertsekas and Ioffe, 1996), FPKF (Choi and Van Roy, 2006) and GPTD (Engel, 2005)/KTD (Ge...
متن کاملOff-policy learning based on weighted importance sampling with linear computational complexity
Importance sampling is an essential component of model-free off-policy learning algorithms. Weighted importance sampling (WIS) is generally considered superior to ordinary importance sampling but, when combined with function approximation, it has hitherto required computational complexity that is O(n2) or more in the number of features. In this paper we introduce new off-policy learning algorit...
متن کاملRecursive Least-Squares Learning with Eligibility Traces
In the framework of Markov Decision Processes, we consider the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We describe a systematic approach for adapting on-policy learning least squares algorithms of the literature (LSTD [5], LSPE [15], FPKF [7] and GPTD [8]/KTD [10]) to off-policy learning w...
متن کاملAn Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function
We present an analysis of actor/critic algorithms, in which the actor updates its policy using eligibility traces of the policy parameters. Most of the theoretical results for eligibility traces have been for only critic's value iteration algorithms. This paper investigates what the actor's eligibility trace does. The results show that the algorithm is an extension of Williams' REINFORCE algori...
متن کامل